Objective
To analyze all the chemical properties of the red wine and determine which factors might be resposible for good quality red wine.
Introduction
Looking at the variables in the dataset, there are some really interesting questions that can be answered.
- Does high content of alcohol increase the quality of red wine?
- Does high content of sugar make the red wine more tasty and hence result in higher quality product?
Let’s explore the data and find out answers to above questions as well as pictographically understand the data.
We will be plotting graphs, identifying the outliers and drawing some inferences about data by looking at the various plots(historgrams, scatterplots, bar plots, etc)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## [1] 132
Observations from summary:
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
# Convert quality to factor
wine$quality.as_factor <- factor(wine$quality)
# Prepare data (quality) for categorization
levels(wine$quality.as_factor)
## [1] "3" "4" "5" "6" "7" "8"
## 3 4 5 6 7 8
## 10 53 681 638 199 18
As we can see, most of the data points are in range 5-6 which we will categorize as medium (quality measure). This is done in the further sections.
Apart from the initial spike in the plot, it seems that the distribution is bimodal. And as we discovered in the introduction section, the major spike is due to many samples not having citric.acid at all i.e. value = 0.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The count has reduced as the alcohol percentage increases, which means there were lesser samples having higher alcohol content. We will uncover in the final plots section if this is a good sign ;)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
We can see some clear outliers in this distribution; apart from that I do not see any clear indications of any measure. Transforming into log10 would make it more clear
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
6-8 is the range between which max number of plots seem to lie, needless to say, this is a normal distribution
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The variation in density is very minor (0.9901 to 1.004). It may seen that, it will not have large effect on the quality, we will look at the plot in bivariate analysis
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
The range of pH is mostly in between 3-4 with some outliers that defy the range but not by huge number
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
This is a positively skewed distribition with a huge difference in the max and min value.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
The above plot shows multi modal distribution for volatile.acidity where the 0.4 to 0.8 is the range containing the max data points
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The data points for free.sulphur.dioxode are not evenly distributed with some outliers beyond 60. The distribution is positively skewed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
This distribution has a long tail but max values are distributed between 0.0 to 0.2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
It is better to transform using the log10 function due to the very long tail; which will help us get a better view
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Summarizing the above findings in terms of distributions in a single table below
| Normal | Non-normal |
|---|---|
| alcohol | citric.acid |
| fixed.alcohol | residual.sugar |
| volatile.acidity | free.sulphur.dioxide |
| density | total.sulphur.dioxide |
| pH | sulphates |
| - | chlorides |
So, we need to transform the variables which do not look normal or close to. We can use the log10
free.sulphur.dioxide : The distribution here is multi-modal; the log10 transformation has been really helpful here
residual.sugar, chlorides, sulphates, total.sulphur.dioxide : evenly distributed data points
Let’s further classify quality of wine into 3 ordinal variables: low(3,4), medium(5,6), best(7,8)
low <- wine$quality <= 4
medium <- wine$quality > 4 & wine$quality < 7
best <- wine$quality > 6
wine$quality.category <- factor(ifelse(low, 'low',
ifelse(medium, 'medium', 'best')),
levels = c("low", "medium", "best"))
levels(wine$quality.category)
## [1] "low" "medium" "best"
ggplot(aes(x = quality.category), data = wine) +
geom_bar(stat = 'count', color = 'black', fill = 'grey')
THe above plot shows that, max data points are in medium catagory and hence of the plots we see would have medium data points over plotted.
We can use the ggpairs function to find the level of corelation between the variables so that ones with very less relevance can be skipped
volatile.acidity seems to have some sharp corelations, we will pit it against most of the chemical compounds and see what effect does it have on the quality of wine
Comparing each of the attributes/compounds of wine with quality and coming to conclusions
Let’s draw plots for all variables against quality of red wine and analyze which variable has a relation(positive, negative, no relation) with quality of red wine
Conclusion: The above factors depict that, they are directly proportional to quality of red wine, i.e. higher the factor, better the quality
Let’s check the remaining ones
Conclusion: The above factors denote that, they are inversely proportional to quality of red wine, i.e. lower the factor, better the quality
We should not miss out on variables that have high corelation coefficient, excluding quality.
The scatterplot for alcohol and sulphates shows that, low quality samples are in bottom left corner and best quality are in the top right mostly; which means the chemical composition of these two compound can be relied on for quality of the wine
volatile.acidity is used here as a bait to confirm that, alcohol and sulphates are compounds that increase the quality of wine. A better looking detailed plot is shown in the final plots section.
fixed.acidity and pH plot shows that, lower right corner, having higher fixed.acidity and lower pH makes the best of wine. More emphasis will be added in final plots section
let’s try to view the relation between resiudal.sugar and alcohol; sweetness with bitterness :)
We can see that, alcohol is unaffected by density or residual sugar.
Criteria for picking the variables for futher analysis, the corelation coefficient > 0.5 or coefficient < -0.5
It is worth exploring if there is any relation between residual.sugar and alcohol
This signifies that, with the increase in volatile.acidity and citric.acid, the quality of wine is drastically decreased.
We can see that the points in color blue, i.e. best quality samples, lie above the corelation slope. Let’s just verify if the above plot is uniform across all the levels of quality since I can see a lot of red dots below the corelation slope
As we can see above the slope in the initial plot was correct and the relation is positive, i.e. for higher values of citric acid and fixed acidity, the perceived quality of wine is better
It can be said that, with the increase in alcohol content and decrease in density, the quality seems to be reduced. However, the ggpairs result shows a completely different story about alcohol. Increase in alcohol seems to affect quality of the wine positively.
Let’s look at some plots that I find interesting
The above plot denotes that, higher the pH and lower the fixed.acidity, better the quality of wine. One may think that pH is a good measure for determining the qualty of wine, however, in the ggpairs plot drawn above, the corelation is not that huge to be considered. Infact, the corelation is negative.
Note: It should also be noted that, the pH value is mostly in the range of 3 to 4
# Let's run the below chunk of code to find how much percent does the value in range 3-4 for pH contribute to in the dataset
paste(round(length(wine$pH[wine$pH < 4 & wine$pH > 3])/length(wine$pH)*100), '%')
## [1] "98 %"
So, out of the total 1599 observations, approximately 98% observations have pH value in the range 3-4
The ggpairs plot also mentions sulphates to have a positive corelation with quality, however, it is also evident that, its relation with no other variable is fairly visible. Let us plot it one of the variables that it has some corelation with i.e. alcohol and also check volatile.acidity’s plot against alcohol
The above plotting proves that, alcohol and sulphates are one of the major contributors in increasing the quality of wine as discussed before. Also, best quality wines are produced with lower amounts of volatile.acidity and higher amounts of alcohol
Since, volatile.acidity had a considerable corelation in the ggpairs plot, let’s check its distribution and find some results and also its relation with some chemical compunds in the dataset
It can be deduced from the above plots that, no matter the other compounds, volatile acidity needs to be lower for a good quality wine
Overall, after observing all the drawn plots, we could see that, alcohol and sulphates have shown to increase quality of wine whereas pH and volatile.acidity decrease the quality
The dataset that we just analyzed was fairly small and hence this cannot be a perfect solution to determine the quality of red wine. Another point that we can consider is that the quality measure that was given here must have been done by some experts which may vary depending on the region (geographical) the experts are from. In short, the information about experts is abstracted from us and hence the results obtained through plots should not be considered as accurate.
I also emphasize on use of a learning algorithm as it will be faster and in case another subset of data is added, will be able to corelate better than we visually determining w.r.t to each dependent and independent variables. Some problems that I faced was analyzing almost all the variables one by one and discarding them in case nothing was very much clear and evident. A programming model could be best suited in this senario where we would set a factor that would determine if a combination of variables be included for further analysis. We could always feed in new factors and various combinations of factors as well. An basic example can be:
- Include only those variables which are forming a coreltion of more than 0.4 or less than -0.4 with quality - Then we can add, also include if the variable is corelated to another variable from the set - Try the same algorithm by excluding medium range (5-6) quality measure, i.e. only for low and high and check if corelation coefficient changes There could be more but these are the simplest of examples that can be thought about
On the analysis front, prediction is another vast tool that can be used to get some more insights and for that again we may have to train algorithms on huge datasets without missing out on points like info about experts (geography, age, sex, etc) along with some other factors that may determine the quality like smell, texture of the color (bright. pale, etc).
EDA is perhaps the best tool to undestand the data and feel it. It is useful when you want to visualize data even before thinking of models and writing code. However, it may have some limitations since it depends on the collected data (i.e. sample).